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Abstract 

Finding a location for a new facility such that the facility attracts the maximal 
number of customers is a challenging problem. Existing studies either model cus- 
tomers as static sites and thus do not consider customer movement, or they focus 
on theoretical aspects and do not provide solutions that are shown empirically to be 
scalable. Given a road network, a set of existing facilities, and a collection of cus- 
tomer route traversals, an optimal segment query returns the optimal road network 
segment(s) for a new facility. We propose a practical framework for computing this 
query, where each route traversal is assigned a score that is distributed among the 
road segments covered by the route according to a score distribution model. The 
query returns the road segment(s) with the highest score. To achieve low latency, 
it is essential to prune the very large search space. We propose two algorithms that 
adopt different approaches to computing the query. Algorithm AUG uses graph 
augmentation, and ITE uses iterative road-network partitioning. Empirical studies 
with real data sets demonstrate that the algorithms are capable of offering high 
performance in realistic settings. 

1 Introduction 

The problem of finding a location for a new facility with respect to given sets of cus- 
tomer locations and existing facilities, known as the facility location problem IfTUTOl 
19,22-25], has applications in the strategic planning of resources (e.g., hospitals, gas 
stations, banks, ATMs, billboards, and retail facilities) in both the public and private 
sectors lfT5l[T6l . The literature contains a line of study that use the residences of 
consumers as the customer locations [22. 23. 25 1. However, customers do not remain 
stationary at their residences, but rather travel, e.g., to work. Consumers are not only 
attracted to facilities according to the proximity of these to their residences. 

Another line of study H}|6) considers the flow intercepting facility location prob- 
lem, where the goal is to identify a location that intercepts the most flow from moving 
customers. Flows are made up by pre-planned customer trips, and the idea is that cus- 
tomers can choose to interrupt their trip to receive a service from a facility at a nearby 
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location. In its original formulation, the problem is to maximize the flow in a net- 
work while placing m new facilities while disregarding existing facilities. Studies of 
this problem have a theoretical focus and do not focus on providing scalable solutions. 
Thus, the largest study considers spatial networks with up to 1,000 nodes |1|. Real 
spatial networks for even small regions are much larger. Another difficulty is to obtain 
real flow data. This led to the development of probabilistic methods J6). 

The increasing availability of moving-object trajectory data, e.g., as GPS traces, 
calls for a new study of the facility location problem that takes into account the real 
movements of the customers that are now available and that provides practical solutions 
that apply in realistic settings. 

We study the optimal segment problem. Given a road network G, a set of facilities 
F, a set of route traversals R, each of which can be taken by different users multiple 
times, the objective is to find the optimal road segments such that a new facility on any 
of these segments attracts the maximum number of route traversals. A route traversal 
is attracted by a facility if the distance between the route and the facility is within a 
given threshold. 

Figure[T]shows an instance of the problem. Solid lines and dots form the road net- 
work. Hollow circles are existing facilities (/i, /2, f^, and fi). Dashed lines indicate 
route traversals (ri, r-x, and r^). We draw them next to the roads for clarity. The gray 
bar that covers f$ indicates that r% is attracted by /a because f% is within distance S 
of one of the end points of r$. The rationale is that a facility that is sufficiently near a 
route will attract customers who follow the route. Therefore, the ends of each route are 
extended by distance S. 




With 5, r\ starts and ends at A and D, respectively. Route T2 starts and ends at v\ 
and B, respectively. Route r$ starts and ends at C and H, respectively. Assume that 
each of the routes is traversed by one customer exactly once. Intuitively, the optimal 
segment for a new facility is the segment AH because this segment attracts the most 
route traversals (in this example, three). 

We propose a framework to solve the optimal segment problem. In the framework, 
each route traversal is assigned a score, and that score is distributed among the road 
network segments covered by the traversal. The scoring of segments is based on three 
factors: the number of customers who take the route (the count), the number of traver- 
sals by each customer (the usage), and the length of the route. 

Intuitively, road segments that are covered by many route traversals and that are at- 
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tracted by few existing facilities are good result candidates. But customers of different 
types of businesses can have different spatial preferences with respect to the businesses 
they are likely to visit. For example, customers may prefer grocery stores near their 
homes or work places, but may have equal probability to visit clothing stores along 
the routes they travel. To accommodate such preferences, we support different func- 
tions for the assignment of scores to the routes that customers follow as well as allow 
different models for the distribution of scores to the underlying segments. 

The framework encompasses two optimal segment algorithms. The first, AUG, uses 
graph augmentation, the idea being to augment the set of vertices of the original road 
network graph with the facilities and the start and end points of the route traversals. 
Each vertex in the new graph records a list of attracted routes. The score of an edge is 
the sum of scores of the route traversals that cover both vertices of an edge. The edges 
with the highest score are mapped back to the original graph and are possibly extended 
into longer segments. 

The second, ITE, uses a heap to prioritize the most promising road segments, and it 
iteratively partitions and scores these based on intersecting routes. ITE keeps partition- 
ing the road segments that most likely contain an optimal sub segment until an optimal 
subsegment is obtained. Then it extends the partial optimal segment to its full length 
and adds it to the result set. 

In summary, the contribution is fourfold: 

• Formalization of the new optimal segment problem. 

• A framework that accommodates different scoring functions and score distribu- 
tion models. 

• Two algorithms, AUG and ITE, that solve the problem. 

• Coverage of an empirical study that indicates that AUG and ITE are efficient in 
realistic settings. 

The remainder of the paper is structured as follows. Section |2]formalizes the prob- 
lem setting. Section |3] presents a preprocessing procedure that is used by both of the 
two proposed algorithms. We describe in detail algorithm AUG and provide a theo- 
retical analysis in Section |4] We then describe in detail algorithm ITE and give an 
accompanying theoretical analysis in Section|5] Section|6]reports the results of an em- 
pirical evaluation of the proposed algorithms. Section|7]re views existing work. Finally, 
Section[8]concludes. 

2 Definitions 

We proceed to model the road network and formulate the optimal segment problem 
along with supporting definitions. 

2.1 Road Network Modeling 

A road network is modeled as a spatially embedded graph G — (V, E), where V is a set 
of vertices, and E is a set of edges that connect ordered pairs of vertices. Every vertex 
Vi has (xi, yi) coordinates in 2D space, denoted as loc(vi) = (xi,yi). We use either ejj 
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or (vi, Vj) to refer to the directed edge from vertices Vi to Vj. The length of an edge is 
defined as the Euclidean distance between its two vertices: He^H = \\loc(vi),loc(vj)\\. 
Vertices and edges are assigned unique identifiers. In this model, an edge between two 
vertices represents a part of a road. The polyline obtained by connecting the vertices 
of consecutive edges approximates the center line of part of a road. 

We use the term network point to refer to a point location anywhere on an edge. 

Definition 1 (Network Point) A road network point p is defined as p = (eid, d), where 
eid is the identifier of an edge e — (i>i,Vj) and d (0 < d < 1) is the ratio of the 
distance between vertex Vi and the point to the length of e. P denotes the set of all 
network points on the road network. 

It can be seen that given an edge (vi,vj) identified by eid, Vi = (eid,0) and 
vj = (eid, 1). Therefore we have V C P. For example, in Figure [TJ the network point 
of fx is f\.p = (e2,3, 0.5). The distance between two network points pi and pj on the 
same edge e is defined as dist(pi,pj) — \\e\\ ■ \di — dj\. 

A road segment is a polyline that starts at a network point, traverses a sequence of 
vertices, and ends at a network point. 

Definition 2 (Road Segment) A road segment s is defined as a sequence of network 
points, s = (p\,p2, ■ ■ ■ ,p n ), where n > 2, Pi,p n € P, Pi <E V,p\.eid = p2-eid, 
p n -i.eid = p n .eid and (p,, Pi+i) £ E (1 < i < n — 1). 
The length of s is the network distance from p\ to p n . 



The set of road segments is denoted as S. 

It follows from definition that an edge is also a segment, i.e., E C S. Further, we 
use the notion route for a segment that a customer has traversed. 

When there is no ambiguity from the context, we use AB to mean the segment 
between network points A and B. For example, the short segment between A and H 
in Figure Q] is AH. Otherwise, we write the segment in full, e.g., the road segment 
(Pf 3 , v&, Ve,Pfi) between facilities f$ and f±, supposing the network points for facili- 
ties fa and fi are p / 3 and p / 4 , respectively. 

2.2 Facilities and Route Usage 

A facility / located at a network point p is denoted as (fid,p), where fid identifies the 
facility. F denotes the set of all facilities. 

A route is a segment and thus starts at a network point, traverses a sequence of 
connected edges, and stops at a network point. The same route can be traversed many 
times by the same or many customers. For instance, many customers who live in the 
same building may take the same route r to the same grocery store. We use count to 
denote the number of customers who take r. On the other hand, one customer can take 
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the same route many times, e.g., a customer may take the same route from home to 
work on most weekdays. We use usage i to denote the number of times r is taken by 
customer i (1 < i < count). 

Definition 3 (Route Usage Object) A route usage object ro is defined as 

ro = {rid, r, count, (usage 1 , . . . , usage count )) 

where rid identifies the object, r E S is a segment traversed by the user. R is a set of 
all route usage objects. 

A route ro.r covers a road segment s' if Vp E s'(p E ro.r). A route ro.r intersects 
a segment s' if 3p E s'(p E ro.r). The set of route usage objects whose routes cover s' 
is denoted as s'.C. The set of route usage objects whose routes intersect s' is denoted 
as s' .1. It is straightforward to see that s'.C C s'.I. We also say that ro\ = ro2 if 
ro\.r = roi.r. 

In FigureQ] the routes r%, r%, and r% are traversed by three different customers. We 
assume that each route is traversed once by each customer. 

A route r is attracted by a facility / and / is an attractor for r if dista {f-P, ro.r) < 
S, where distG (p, s) gives the shortest network distance between a network point p and 
a segment s and 5 is the distance threshold that was introduced earlier. Note that the 
same facility can attract several routes. In FigureQ] facility f\ attracts routes n and r2, 
and attracts r$. 

2.3 Scoring a Route 

In the optimal segment problem, route traversals play the role that customer locations 
play in the classical formulation of the optimal location problem. Thus, we need to 
decide how to assign a score to a route based on the traversals of the route. The scoring 
of a route is thus based on three factors that are all captured in the route usage object for 
the route: the number of customers taking the route, the number of traversals by each 
customer, and the length of the route. The route's score is subsequently distributed 
among the segments covered by the route. The intuition of distributing the score of 
a route to its segments is that when a customer traverses the route, the customer may 
visit facilities located on segments along the route. 

To ensure that the framework yields meaningful results, the scores eventually as- 
signed to segments must be invariant under the splitting and concatenation of route 
usage objects. To achieve this, we require the following property to hold. 

Route Scoring Property A route scoring function should be independent of 
the partitioning of the route of a route usage object. Let roi o ro^ be the concate- 
nation of ro\.r and ro^.r. Let ro = (id,r\ o r2 o ■ • • o r m , count, u) and m = 
(idi, Ti, count, u) (1 < i < m). Then we require score(r) = 53i=i score[n). 

This property ensures that partitioning a route usage object does not change the 
total score that is available for assignment to segments. 

Many scoring functions are possible that satisfy the property. Next, we show two 
of them. 
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Definition 4 (Scoring a Route) Let a route r with an associated route usage object 
ro = (rid, r, count, (usage 1 , . . . , usage count }) be given. Then the score of r can be 
defined as follows 

count 

score a ii (r) = length(r) ro.usagei 

i=l 
count 

score cap—x ( r ) — length(r) mir^ro.usage^ x) 

i=l 

where x is a user-defined value. 

Depending on the products or services offered by a facility, different scoring func- 
tions may be appropriate. For example, a facility that sells everyday necessities (e.g., a 
bakery) may attract the same customer on each route traversal by the customer. Thus, 
score a ii is appropriate. In contrast, if a store sells products that are bought less fre- 
quently (e.g., a furniture store), the store may not benefit from a large number of traver- 
sals by the same customer, making score cap - x more appropriate. Thus, we keep the 
framework open to the use of different scoring functions. 

Unless specified otherwise, we use the function score a u for illustration. 

In Figure Q] assuming that the lengths of routes r\, r%, and r$ are 2, 4, and 3, and 
the number of traversals per customer are (2, 2), (2, 1), and (2), respectively. Then 
we have three route usage objects: ro\ = (idj. , ri , 2, (2,2)), ro2 — (idg, ^a, 2, (2, 1)), 
and r<?3 = (ids, f"3, 1, (2))- The score of r\ can be calculated as follows, score(r{) — 
length(ri) • (ro\.usage 1 + ro2-usage 2 ) — 2 • (2 + 2) = 8 Similarly, we calculate the 
scores of r 2 and r%, score (r 2) — 12 and score(r^) = 6. 

2.4 Score Distribution Models 

A score distribution model determines how to distribute the score of a route to the 
underlying segments. 

Intuitively, segments covered by a route with many traversals that are not attracted 
by many other facilities are good candidates for placing a new facility. Therefore, they 
should be assigned high scores. But customers can have different spatial preferences 
for visiting different kinds of businesses. Therefore, we leave the framework open to 
the use of different score distribution models. 

When n facilities are located on a segment, they partition the segment into k sub- 
segments where k is one of n — 1, n, or n + 1 depending on whether two, one, or no 
facilities are located at the ends of the segment. 

The following are example score distribution models. 

• Equal weight is assigned to each subsegment. In this model, the score of a route r 
is distributed such that a customer has an equal probability to visit any business 
along the route. For example, any clothing store on the way back home. The 
score assigned to the ith subsegment s% (1 < i < k) is i ■ score(r). 
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• Decreasing/increasing weights are assigned to the subsegments. The score of 
the ith subsegment s$ (1 < i < k) is given by ■ spk 1 x ' score ( r )> (1 < 
i < fc). This definition gives exponentially decreasing scores to subsegments 
and normalizes the scores such that the full score of the route is distributed. 
This model indicates a preference for the facilities at the beginning of the route. 
Symmetrically, there is a model that prefers the facilities at the end of the route. 
For example, a customers might prefer to have a meal before the trip back home 
or to work, but it is also possible (with lower probability) that the customer will 
visit any restaurant along the route 

• All of the score is evenly distributed to the first and the last subsegment. This 
model indicates that customers consider only businesses that are located nearest 
to the route destinations. For example, a customer would like to visit the store, 
which sells dairy products, closest to home, but for regular items closest store to 
work place can be used. 

• The original/flc/Z/fy location problem considers simply the attraction of customer 
locations to facility locations. In our setting, where customer route traversals are 
attracted to segments where facilities may be placed, the model that assigns the 
entire score of a route traversal to the route's first subsegment may be the one 
that most closely resembles the original problem. 

In Figure [T] route r 2 is attracted by facilities fi and f 2 , and k — 3. In pre- 
vious examples, we showed that the score of r 2 is 12. According to the first pro- 
posed score distribution model, each subsegment (ui/i, fif 2 , and f 2 E) receives score 

score ^ = ±p = 4. According to the second model, 1 , = t , \ , ± = 4 = |, 

k 3 5 EJU i7 3 + 3+1 i 7 

scorejvxh) = \ ■ f • 12 = f • 12, score(f x f 2 ) = \ ■ f • 12 = f • 12, and 
score(f 2 B) = ± • f • 12 = ± • 12. 

So far, we have distributed the score assigned to a single route to the segments 
covered by the route. However, a segment s may be covered by multiple routes that 
assign score to the segment. The total score of the segment, scoreM(s), where M 
indicates the score distribution model used, is simply the sum of these scores. Similarly, 
we can calculate the score of a network location p, scoreM(p)- 

We show how to score the subsegment AH in Figure Q] using the first proposed 
model. AH is covered by all of the three route usage objects. So scoreM(AH) = 

E3 score(roi.r) 8_i_12_i_6 -1-1 
i=l fci ~ 2 ~r 3 + 2 ~ 1X - 

Since our framework is generic w.r.t. score distribution models, unless specified 
otherwise, we use the first model for illustration. 

2.5 Problem Formulation 

With the above definitions in place, we can define the optimal segment query. 

Definition 5 (The Optimal Segment Query) The optimal segment query finds every seg- 
ment s pt from a road network G such that 
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1. Vpi,p 2 G s opt (score M (pi) = score M (pi)) 

2. Vp G s opt Vp' G P (score m(p') < score M (p)) 

3. Js' G 5 (s op t C s' awe/ 7 and 2 hold). 

This definition ensures that every point in the optimal segment has the same score, 
that the score is optimal, and that the optimal segment is maximal. 

Notation used introduced this section and to be used throughout the paper is sum- 
marized in Table Q] 



R 


The set of route usage objects 


F 


The set of facilities 


S 


The set of road segments 


P 


The set of sites 


G, G 


The (augmented) road network graph 


V, V 


The set of vertices in G, G' 


E, E' 


The set of edges in G, G' 


n 


The total number of GPS points in R 


S 


The maximum distance of attraction 



Table 1 : Summary of Notation 



3 Preprocessing 

A straightforward approach to compute the optimal segment query is to enumerate and 
score all possible segments and then return the one with the highest score. However, 
this is not feasible as there is an infinite number of possible segments. Thus, different 
approaches are needed. 

The two algorithms we propose both rely on the same preprocessing algorithm, 
which we present here. This algorithm determines the relationships between the facili- 
ties and the edges, between the routes and the edges, and between the facilities and the 
routes. It needs to be run only once for one set of routes. 

The algorithm makes each edge record its facilities and route start and end points, 
if any. It also makes each vertex record the covering routes' identifiers. The routes 
record the facilities they cover. The facilities record the edge they are located on and 
the covering routes, if any. Also the algorithm populates a lookup table so that given 
an edge, one can quickly determine the routes that intersect with the edge. 

Recall that G is the spatially embedded graph, / is a facility, and r is a route. 
Algorithm PreProcess calls getEdge(/, G) to retrieve the edge where / is located. It 
also calls getEdges(7-, G, 5) to retrieve the set of edges that intersect r. 

The PreProcess procedure is presented in Algorithm [T]and explained next. 

A facility / keeps the edge where it is located in the variable f.e c , and the set of 
routes it attracts in f.R c . An edge e keeps a set of route start and end network points 
that are located on e in e.O c . This list is used by the AUG algorithm for augmentation 
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purpose. Each vertex v of e maintains a list of routes that it attracts in v.R c , and v's 
relative positions in v.L. These two lists are used later by AUG for scoring purpose. 
e.F c and e.R c are the set of facilities that are located on e and the set of routes that 
intersect edge e, respectively. A route r keeps its set of attracting facilities in r.F c . 

For each facility /, PreProcess retrieves its edge e so that e adds / to its set of 
facilities e.F c , and f.e c is set to the right edge (line 1). 



Algorithm 1: PreProcess(G, R, F, S) 

1 foreach / e F do 

2 e <- getEdge(/, G); e.F c .add(/); f.e c <- e; 

3 foreach ro e R do 



4 r <— ro.r; 

5 r.F c «- getEdges(r, G, 5); 

6 foreach e = (w s , u e ) e r.F c do 

7 if (r.p s .eid — e.eid) A (r.p s .d £ {0, 1}) then e.O c .add(r.p s ); 

8 if (r.p e .eid — e.eid) A (r.p e .d ^ {0, 1}) then e.O c .add(r.p e ); 

9 e.i? c .add(r); 

10 if contains(r, e) then 

n r.F c <- r.F c U e.F c ; 

12 foreach / e e.F c do /.i? c .add(r); 

13 else if intersects(r, e) then 

14 F' <- {/|/ 6 e.F c A attracts(/, r, 5)}; 
is r.F c <r- r.F c U F'; 

16 foreach / e F' do f.R c .add(r); 

17 if attracts(w s , r, 5) then 

18 w s .i? c .add(r); i <— the position of v s relative to the r.F c ; 
v s .L.&dd(i); 

19 if attracts(w e , r, S) then 

20 w e .i? c .add(r); i <— the position of w e relative to the r.F c ; 
w e .i.add(«); 



Next, for each route r, the set of intersected edges is retrieved, and the r.E c field is 
updated (line 5). Then, if the start network point of r is not a vertex in G, it is added 
to the e.O c set of the edge e where it is located. Similarly, r's end network point is 
added to a e.O c set. (lines 7-8). Route r is also added to the list e.R c (line 9). For 
each edge e covered by the route r, the facilities and r record each other (lines 10-12). 
For each edge e intersected by a route r, on the other hand, the attraction relationship 
between the facilities and r is determined before updating each other's corresponding 
field (lines 13-16). Next, each vertex of e records r in the v.R c list if r is attracted 
by it. In addition, the relative position of v is also kept in v.L for scoring purpose 
(lines 17-20). 

In FigureQ] edge e2,3 has e2,3.F c = {fi} and fy.e = e^^- Route v\ traverses one 
edge and is attracted by one facility, so, r\ .E c = {62.3} and r\ .F c = {/1}. Edge e2,3 is 
covered by r\, r-z and r-&, so, e^.z-Rc — r%, rs}. Three start or end network points 
of the routes are located on edge e2,3, so, e2.3-O c = {A D, H}. Vertex V2 is covered 
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by T2 and and vertex V3 is covered rg, so V2-R c — {72, ^3} and V3.i? c = {r'a}. 
Facility /1 attracts r% and r2, so f\.R c = {ri, r2}. 

The following lemma states the time complexity of PreProcess. 

Lemma 1 Algorithm PreProcess has time complexity 0(\F\ + |i?||.E m |), where \E m \ 
is the maximal number of edges that any route traverses. 

Proof 1 The first loop in the algorithm takes time 0{\F\). In the second loop, the 
outer loop runs 0{\R\) times. The inner loop depends on the number of edges that a 
route traverses. Let \E m \ be the maximal number of edges that any route traverses. 
Then the second loop has time complexity 0{\R\\E m \). In total, the time complexity is 
OflFl + lflll^l). 

4 Graph Augmentation 

4.1 Overview 

The main idea of the graph augmentation algorithm (AUG) is to augment the road 
network graph G with the facilities and the first and the last network points of each 
route. In the augmented graph G' = (V',E r ) it is guaranteed that each route starts 
from a vertex and ends at a vertex. Meanwhile, each vertex in G' stores the identifiers 
of the covering routes. 

Then each edge's score in G' can be calculated by summing up the scores dis- 
tributed by the routes that cover both ends points. The score contributed by a route 
is calculated based on the specific score distribution model used, as discussed in Sec- 
tion ES 

Next, AUG examines every edge in G' with a score, and identifies the edges with 
the highest score (the optimal edges). 

Finally, the algorithm maps the optimal edges back to the original graph G, where 
they are segments. Then AUG merges connected segments, if any, to form maximal 
segments, and returns them as the result. 

Figure |2]illustrates the graph in Figure [T]after being augmented with routes r\, r?,, 
and f3 and facilities fi, /2, and f%. Note that each vertex in the augmented graph has 
a list of the identifiers of the routes that cover the vertex. We use AL(vi) to denote the 
attraction list of Uj. Intersecting the sets of two adjacent vertices gives the routes that 
cover the edge, whose score can then be calculated according to a score distribution 
model. For example, AL(A) — {r\,r2,r$} and AL(H) = {ri,r2,rs}. So, the set 
of routes that cover edge eA,H is AL(A) n AL(H) = {ri, ra, r^}. Then the score of 
eA.H is calculated based on the score distribution model used. 

Next, AUG finds the edges with the highest score by examining all edges in the 
augmented graph. These edges are then mapped back to the original graph, and become 
road segments, which are possibly merged into longer segments. These segments are 
returned as the result. In Figure |2] after edge eA.H is identified as the optimal edge 
with the highest score, it is mapped back to the original graph, and the segment AH is 
returned as the result. 
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Figure 2: The Augmented Road Network Graph 



4.2 The AUG Algorithm 

Algorithm|2]presents details of the AUG algorithm. The set of edges that have network 
points either from facilities or routes is obtained (lines 1-2). Graph G' is obtained by 
augmenting graph G with the network points of F and R (line 3). Note that some 
network points of routes or facilities may happen to be vertices. These network points 
are excluded from being augmented into G. Then AUG updates the covering routes of 
the newly added vertices (lines 4-6). It initializes the result set S and the highest score 
seen so far, optS (line 7). In the next loop (lines 8-14), the scores of the edges in G' 
are calculated according to the score distribution model used (line 10), and the optimal 
edges in G' are identified and stored in S. In lines 15-17, each edge in G' is mapped 
back to G. Mapping back to the original graph is a trivial task. Recall that each new 
network point has an eid field that helps identify the original edge. If a segment can 
be extended (i.e., the neighboring segment is also an optimal segment), it is extended 
(line 17). Finally, the result set S is returned (line 18). 

This process has two implications. First, an edge in G may be split into several 
edges in G'. After the optimal edges are identified in G', they must be mapped back to 
G. Second, for an edge in G', a route either covers it or does not cover it. The partial 
intersection relationship between a route and an edge is eliminated in G '. 

It can be seen that it is sufficient to just augment the original graph with the start 
point and the end point of each route for finding the optimal segments because the 
internal points in a route are vertices in G. 

The algorithm splits some road segments. 

• If the two end points of a route do not happen to be vertices in G, they are added 
as new vertices into the road network, as they are covered by at least one route. 

• If a facility does not happen to be a vertex in G, it is added as a new vertex if it 
attracts any route, e.g., /i, f2, and fa in Figure [2] Facility f& no longer exists in 
the augmented graph because it does not attract any routes. 

• In order to accommodate these new vertices, some edges in G are replaced with 
"smaller" edges in G'. For example, in Figure [2] the edge e2.3 is replaced with 
the following edges: e V2t A, &A,n, e-Hj x , ef u D, and e DyV3 . 

When the road network is augmented, every vertex in G' records the identifiers of 
the routes that cover this location. Figure |2]also shows the identifiers of the routes that 
are recorded at each vertex in the augmented graph. 
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Algorithm 2: AUG(G, R, F, S, M) 



1 Ep <— the set of edges where F are located; 

2 En «— the set of edge that R intersect; 

3 G' <- Augment(G,F,R); 

4 foreach e e (E F U -Er) do 

5 foreach v t e e.F c U e.O c do 

6 Wj-i? c getCoverRouteIds(w i , e.R c ); 
?5^0; opt5 <- 0; 

8 foreach e = (v s ,v e ) 6 G'.£J' do 

9 R' ^ v s .R c r\v e .R c ; 

to scoreM (e) ^— compute the score of e based on M; 
n if score (e) > optS' then 

12 optS* <— score (e); 

13 S <- {e}; 

14 else if score(e) = optS" then S 1 <— S'.add(e); 
is foreach ,s e 5 do 

16 map s to G; 

17 if canExtend(s, G) then extend(s, G); 
is return S; 



In Figure|2] r2 has score 12, and is attracted by 2 facilities. Thus, each edge in G' 
that is covered by ?'2 should receive a score sco ";( r2 ) = 4. Route ri has score 8, and is 
attracted by one facility. Each edge covered by it in G' receives a score scor ^ ri ) = 4 
Route r3 has score 6, and is attracted by 1 facility. Therefore, each covered edge 
received score = 3. 

For each edge in the augmented road network, the algorithm takes an intersec- 
tion of the route identifiers of its two vertices, and computes its score. For instance, 
score(e V2 ^A) = 4 + 3 = 7, score(eA,H) = 4 + 4 + 3 = 11. The scores of other edges 
can be computed in a similar way. 

After that, AUG identifies the optimal edge(s) with the highest score. Since AH 
has the highest score in G', the optimal edge is (A, H). It is mapped back, and becomes 
the segment AH. As AUG cannot extend it to a longer segment, AH is returned as the 
result. 

4.3 Analysis 

We analyze the time complexity of the AUG algorithm, and show its completeness and 
correctness. 

Theorem 1 The AUG algorithm has time complexity 0{{\E\ + \F\ + \R\)\R\ + \S\). 

Proof 2 In AUG, the graph augmentation takes 0(\F\ + 2\R\) (line 4), because each 
route contributes exactly two vertices. 
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In the first loop (lines 5-9), the worst case is that facilities and routes are evenly 
distributed to the road network so that every edge in G is augmented. In this case, for 
an edge e, e.F c + e.O c = ^-jgjp^ - Therefore, the outer loop takes \E F \ + \E R \ < \E\ 

and \L\ = 0( |F| ^ | |fl| ). So this loop takes time 0(\F\ + 2\R\). 

In the second loop (line 11-18), \E'\ < \E\ + \F\ + 2\R\. In line 15, \R'\ < \R\. 
Therefore, this loop takes time 0((\E\ + \F\ + 2|i?|)|i?|). The third loop takes time 
complexity \S\. Note that \S\ is usually very small. 

Tosumma rize, the time complexity of A UG is 0(\F\+2\R\ + {\E\ + \F\+2\R\)\R\ + 
\S\). After simplification, the time complexity is 0((\E\ + \F\ + \R\)\R\ + \S\). 

We proceed to show the correctness and completeness of AUG. 

Theorem 2 A segment output by the AUG algorithm is an optimal segment. 

Proof Sketch. In AUG, every edge in the augmented graph is checked to find 
the the value for optS. AUG then adds a segment iff the segment has a score equal to 
optS. It implies that any segment in the result set S must is optimal. 

Theorem 3 The AUG algorithm finds every optimal segment in the graph. 

Proof Sketch. This theorem can be proved with the following two points. First, 
AUG searches the graph to make sure that every edge is scanned. Second, if an edge 
has a score equal to optS, it either is appended to an existing segment in S or is added 
to S as a new segment that might be extended later. Therefore, no segment with score 
equal to optS is missed. 

5 Iterative Partitioning 

5.1 Overview 

Although the augmentation approach is effective at finding the optimal segments, we 
can improve its efficiency by pruning unpromising segments. 

The idea of the ITE algorithm is to quickly identify a subsegment of an optimal 
segment (optimal subsegment) and then extend the optimal subsegment into an entire 
optimal segment. Therefore, ITE organizes the segments using a heap such that those 
segments that are most likely to contain an optimal subsegment get examined first. 
If the segment under examination is an optimal subsegment then the entire optimal 
segment can be found by extending it. In addition, the optimal score can be calculated 
easily. Otherwise, the segment is partitioned into smaller segments, whose likelihoods 
of having an optimal subsegment are also calculated, upon which they are inserted back 
into the heap. 

Given a segment s, we use the scores of the intersecting routes to measure its like- 
lihood of having an optimal subsegment. The segment containing the optimal subseg- 
ment is likely to have many intersecting routes, from which it is likely to receive a high 
score. 



13 



For example, in Figure Q] initially the edges that intersect any route are inserted 
into the heap. The edge V1V3 has the most intersecting routes and so is likely to contain 
an optimal segment. So v\v% is partitioned into equal-sized, smaller segments. ITE 
calculates the intersecting routes for each of them, and adds them to the heap. This 
process continues until a subsegment of an optimal segment is found. In this case, a 
subsegment s of AH is found. Then s is extended to find that AH is the entire optimal 
segment. 

Both AUG and ITE partition the edges of the network graph into smaller pieces. 
The main difference between ITE and AUG lies in how a subsegment of the optimal 
segment is found. In AUG, the partitioning of edges in the network graph is unguided. 
Every edge that has an attracting facility or a route end point is partitioned. In ITE, 
the partitioning of edges is guided by the likelihoods of the edges to have an optimal 
subsegment. 

5.2 The ITE Algorithm 

Recall that we are interested in finding those segments that contain an optimal subseg- 
ment. Before presenting the ITE algorithm, we need definitions that relate the score of 
a segment to the scores of its network points, as defined in Section [Z4l 

Definition 6 Given a road segment s, we define its min score s.min and max score 
s.max as follows. 

s.min = min score m(p) 
s.max = max score m(p) 

By definition, an optimal segment s op t has s op t-min — s opt .max. 
Next, we define upper and lower bound scores of a segment s in order to only 
process those segments that may contain an optimal location. 

Definition 7 Given a segment s and a score distribution model M, let s.I and s.C 
be defined as in Section |Z21 and let s.lb and s.ub denote the upper and lower bound 
scores of s. We define: 

s.lb= ^ w M (j l ,k l )score(r i ) 
s.ub = ^ w M {ji,ki)score(ri) 

where s is the jith segment of Vj with ki attracting facilities, andwuiji-, ki) computes 
the fraction of Yj 's score to be assigned to s based on M. 

If a segment has facilities located on it, the segment has subsegments that may be 
assigned different scores based on the score distribution model. In this case, the lower 
bound score of the segment still takes the smallest score value being assigned to the 
subsegments, while the upper bound score takes the largest score value. 
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Lemma 2 Let M be a score distribution model where a route can only distribute non- 
negative scores. Let the min and max scores and the lower and upper bound scores of 
s be defined as above. Given a road segment s, we have s.lb < s.min and s.ub > 
s.max. 

Proof 3 Suppose a network location pi 6 s s.t. scoreM(pi) = s.min. Since each 
route r G s.C contains s, we have p\ £ r. So p\ at least gains the scores distributed 
by the routes in s.C. Then scoreM(pi) > es c w M{ji, ki)score(ri). 

Let p2 G s be a location s.t. score m(pi) = s.max. We show that the set of routes 
that contribute scores to pi is a subset ofs.I. The set of routes that contribute scores to 
Pi consists of two sets, the set of routes that contain s ( s.C) and the set of routes that 
cover pi ( s.I'). Each route in s.I' must also intersect s, so s.I' G s.I. Since s.C C s.I, 
we have (s.C U s.I') C s.I. That is, scoreM(pi) < Sr-es / w M(ji, ki)score(ri). 

Recall that Algorithm PreProcess builds a mapping from each edge e to its inter- 
secting routes e.R c . We then compute the upper and lower bound scores for a segment 
s G e by retrieving its s.I and s.C from e.R c . The algorithm can use the bounds to 
prune the segments that cannot contain an optimal subsegment. 

Lemma 3 Given two segments s± and si, if s\.lb > si.ub, then si does not contain 
an optimal subsegment. 

Proof 4 We prove Lemma\3\by showing that S2 cannot contain any optimal location. 
Assume two points p\ G S\ and pi G Si. We have scortM^Pi) > S\.lb > s^.ub > 
scortMip?)- So pi cannot be an optimal location. 

With Lemma[3] segments that do not contain an optimal subsegment can be pruned. 

The second strategy employed in ITE is to prune the segments that eventually lead 
to the same optimal segment. These segments should be detected and pruned early to 
avoid partitioning them further and making unnecessary calculations. 

Lemma 4 Given two segments s\ and si, if si.I G s\.C and si contains an optimal 
subsegment of an optimal segment, then s\ also contains an optimal subsegment of the 
same optimal segment. 

Proof 5 Let the optimal segment be s op t> an d let si contains an optimal subsegment 
of s opt- Then we have s op t-C C si.I because every route that contain the optimal 
subsegment must intersect with 82. 

Since si.I C s\.C, we have s opt .C C s\.C. By the definitions of segment score 
and optimality, we also have s op t-C — S\.C. Therefore, by the definition of segment 
score, S\ is also an optimal subsegment of s op t- 

Once the result set is not empty, Lemma|4]allows us to prune segments that lead to 
the same optimal segment. We study the effectiveness of the pruning strategies in the 
experimental evaluation. 

Figure|3]shows the edge ei^ from Figure Q] It illustrates the calculation of segment 
score upper bound and lower bound. The edge ei^ has a facility f\ built on it, resulting 
in two subsegments, si from the beginning to f\ and si from f\ to the end. The routes 
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r% and r% are attracted by one facility, whereas r 2 is attracted by two facilities. We 
show how to calculate the score upper and lower bounds for both s\ and S2. Segment 
si is intersected by r% and r%, and contained by r 2 . Therefore, s\ .lb — ^score(r 2 ), 
Si.ub = ^score(ri) + ^score(r 2 ) + ^score(r 3 ) 



v 2 +- 



Figure 3: Segment Upper and Lower Bound 

Similarly, S2 is intersected by ri, and contained by T2. Therefore, s 2 .lb = ^score(r 2 ), 
2 .ub = 7}Score(ri) + ^score(r 2 ) 

Algorithm [3] shows the pseudo-code of the ITE algorithm. This algorithm uses a 
priority queue Q that is sorted on the upper bound score of every segment. A variable 
called maxLb is used to keep track of the maximum lower bound score seen so far. 

First, ITE initializes the edges such that each has a lower bound score and an 
upper bound score computed as in Definition [7] (line 1). It also initializes the result 
set S, enqueues the edges G.E of the road network graph, and initializes variable 
maxLb (line 2). It then enters the loop and pops out the top element from Q (lines 3- 
4). The flag variable split, indicating whether or not the current segment needs to be 
partitioned, is set to false at the beginning of each iteration (line 5). Next, if the upper 
bound score of currSeg exceeds maxLb then it needs to be further partitioned, so split 
is set to true (lines 6-7). If the upper bound score of currSeg is equal to maxLb, we 
have found a result segment if the upper and lower bound scores are the same. Then 
currSeg is added to the result set (lines 8-10). However, if the upper and lower bound 
scores differ, ITE tests whether currSeg might lead to an optimal subsegment of a new 
optimal segment that is not seen before. Then ITE checks if there is a result s in S such 
that s.C is subset of currSeg. I (see LemmalU). If no, split is set to true (lines 1 1-12). 

If split is true, the function partitions currSeg into subsegments with the proce- 
dure SplitSegment, which partitions a segment G into f3 equal length subsegments 
(lines 13-14). Here j3 is a tunable parameter. In the experimental studies, we show the 
effect of (3. 

The intersection set, contain set, and lower and upper bound scores for each seg- 
ment output by SplitSegment are computed and inserted into Q (lines 15-22). Next, 
maxLb is updated if the subsegment has a higher lower bound score (lines 23-24). 
Then these subsegments are added back to Q (line 25). 

Upon exiting the loop, each optimal segment is extended to its full length by over- 
lapping the routes that contribute scores to the segment (lines 26-27). 

We continue to use Figure [T] to illustrate the execution of Algorithmic We show 
the iterative partitioning of e 2 ^ in Figure|4] Table|2]shows the top entries of the queue 
obtained from partitioning e 2 ^, together with their upper and lower bound scores dur- 
ing the execution of ITE. Double lines separate iterations. The segment at the top of 
the queue is in bold. 
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Algorithm 3: ITE(G, R, F, 6, P, M) 



1 Init. e 6 G.E s.t. e.lb <- and e.ub <— Er-ee r w M(ji, fcj)score(r*j); 

2 S <— 0; Q.enqueue(G.E); maxLb <— 0; 

3 while Q ^ do 

4 currSeg <r- Q. dequeue; 

5 spHf <— false; 

6 if currSeg.ub > maxLb then 

7 spZii ■<— frwe; 

8 else if currSeg.ub = maxLb then 

9 if currSeg.ub = currSeg.lb then 
10 S.add.(currSeg); 

n else if $s e 5 rac/i f/iof currSeg.I C s.C then 

12 sp/«i ■<— irae; 

13 if sp/if then 

14 ss 4— SplitSegment (cwrrSeg, /3); 
is foreach s e ss do 

16 foreach r e currSeg.I do 

17 if intersects (r, s) then 
is s.7 <— s.7.add(r}); 

19 s.wfe 4— s.m& + wmU, k) score(r); 

20 if contains (r, s) then 

21 s.C e.C.add(r); 

22 s.lb <— s.lb + wm(j, k)score(r); 

23 if s.lb > maxL b then 

24 maxLb sJ6; 

25 Q.enqueue(s); 

26 foreach seSdo 

27 Find the entire optimal segment of s by overlapping the route usage objects 
r G s.C one by one. 



Below, we calculate the upper and lower bound scores of V2P1, which is inter- 
sected with ri, r-x, and r^, and contained by r2 and r$. Therefore, ub(v2Pi) — 
Ei=i score{ n )h = |+f + f = 11 and lb{Wpi) = E<=a score(r J )fc l = f +| = 7. 
The upper and lower bound scores of other segments can be computed in a similar way. 

Since segment U2P1 has the largest upper bound score and its upper bound is not 
the same as the lower bound, it is split as shown in Figure |4(b)| The upper and lower 
bound scores of the subsegments are also computed. Now maxLb = 11 

In the next iteration, segment pTp2 has the largest upper bound score. But still, its 
upper bound is not the same as its lower bound. It is then split (Figure [4(c)] i. The upper 
and lower bound scores of the subsegments are also computed, and maxLb = 11. 

The next segment under examination is pEpE, which is split again because its upper 
bound is different from its lower bound (Figure |4(d)} . Still, maxLb = 11. 
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Figure 4: ITE Execution Example 
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Table 2: ITE Execution Example 



The next segment under examination is pepi, whose upper and lower bounds are 
the same. The upper bound of pepi is also same as maxLb. Therefore, pop~[ is added 
into the result set as an optimal subsegment. 

The process continues until an optimal subsegment of every optimal segment in the 
network graph is found. Q is updated at the end of each iteration. Note that ITE does 
not need to examine those segments with score upper bound less than maxLb (11 in 
this case), resulting in a substantial reduction of the search space. 

In the end, the entire optimal segment AH can be found by overlapping the routes 
AH.C, n, f%, and r^. 

5.3 Analysis 

We consider the correctness and completeness of ITE and analyze its time complexity. 

The correctness of ITE depends on finding the subsegments with the maximum 
score correctly. Here, we prove that the algorithm terminates and returns subsegments 
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that have the maximum score. We must show that after a finite number of iterations, 
ITE produces a subsegment s such that s.ub = s.lb where s.ub is the maximum score 
among all the subsegments. First, we know that when s.ub = s.lb, s is a consistent 
segment with the score s.ub. Since s.ub is the maximum among all the subsegments 
ensured by the property of the priority queue, s is a subsegment with the maximum 
score. Since ITE always examines the subsegment with the maximal s.ub, we only 
need to show that ITE terminates. This can be shown by the following properties. 
(1) The maximum ub value decreases and (2) The maximum lb increases. (3) The 
maximum ub and lb values converge to the same value after a number of iterations. 

Next, to prove completeness, we show that for each optimal segment, ITE is able 
to find a subsegment with the maximum score that is contained within the optimal 
segment. Let Si and Sj be the subsegment of two distinct optimal segments. Without 
loss of generality, suppose ITE has found Sj. We show that ITE also finds Sj instead 
of pruning it. Recall that ITE uses two pruning criteria to prune a subsegment. The 
first criterion says that Sj can be pruned if S{.lb > Sj.ub. Since both s, and Sj are 
subsegments of optimal segments with the same optimal score OPT, we have Sj.ub > 
OPT > Si. lb. Therefore, this pruning criterion does not apply. The second criterion 
states that Sj can be pruned if Sj .1 C Sj.C Since Si and sj are subsegments of different 
optimal segments, Sj.C Sj.C. We also know that Sj is a subsegment with the 
maximum score, hence Sj.I = Sj.C. Putting them together, we have Sj.I = Sj.C 
Si.C. Hence the second pruning criterion also do not apply. Thus, ITE does not prune 
Sj, but detects it as a part of an entire segment which is also found. Therefore, ITE will 
find all the optimal segments. 

Theorem 4 The time complexity of ITE is 0((log\R\ + \S\)\R\). 

Proof 6 In the priority queue operations, ITE iteratively splits the segment with the 
maximal score upper bound. The number of splits corresponds to the height of the tree 
with fan-out j3. If /3 = 4, we get a quadtree. According to [11], the asymptotic height of 
the quadtree is log\R\. For each subsegment s, ITE uses a loop to find its intersection 
set s.I and contain set s.C. We have ss.I C R, so the time complexity of the loop is 
0(\R\). Thus, time complexity of the while loop is 0(\R\log\R\). 

The second loop depends on the size of the result set S. Since s.C C R, the time 
complexity of this loop is 0(\R\\S\). 

In total, the time complexity of ITE is 0((log\R\ + |5|)|i?|). 

6 Experimental Study 

This section reports on empirical studies that aim to elicit design properties of the 
proposed framework and, in particular, of the AUG and ITE algorithms. The studies 
use a real spatial network and real facility and trajectory data, as well as synthetic data. 

The experiments covered in this section were performed on an Intel Xeon (2.66Ghz) 
quad-core machine with 8 GB of main memory running Linux (kernel version 2.6. 18). 
Both of the algorithms were implemented in Java. Every instantiation of JVM was 
allocated 2 GB of virtual memory. We first describe the data used in the experiment as 
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well as the parameter settings. Then we cover experiments that target different aspects 
of the algorithms. 

6.1 Data Sets and Parameter Settings 

6.1.1 Road Network 

The digital road network TOP1ODK0 was used for our experiments. It contains all of 
Denmark at a fine granularity. 

To construct the road network graph, we first identify the vertices. An edge exists 
between two vertices v\ and «2 as long as there exists a road segment connecting v\ 
and V2- In total, the graph contains 465,057 vertices and 920,218 edges. 

In order to study the performance of the algorithms thoroughly, we used a real- 
world data set and a synthetic data set. Each data set contains a collection of routes 
(GPS recordings received from drivers) and a collection of facilities. Both data sets 
share the same underlying road network. 

6.1.2 Route Data Preparation 

We obtained the real route traversal data set from the "Pay as You Speed" project ifTTl Fi 
The data set is obtained from vehicles driving in North Jutland, Denmark. The data set 
contains 39,688,695 GPS points produced by 151 different drivers in the period from 
October 1, 2007 to January 31, 2008. In this data set, each route is represented by a 
sequence of GPS points that may deviate from the underlying road network. To solve 
this problem, we use an existing technique by Tradisauskas et al. [21 1 to map-match the 
route data onto the underlying road network. Then the sequence of traversed edges has 
also to be determined because two consecutive GPS points may be matched to different 
edges. To achieve it, we use a bidirectional Dijkstra's algorithm provided by Pohl ||20l . 

In addition, stationary points, when reported GPS locations are the same for con- 
secutive time points for the same user, are removed. Further, different trips of users 
were identified from the set of GPS recordings. We distinguish a new route when the 
time period between two consecutive GPS points is more than 3 minutes. In total, we 
obtain 51,146 routes. The median number of GPS points of the routes is 488. The 
median length of the routes in the real data sets is 6524.81. 

We generate synthetic routes by simulating the movement of a vehicle that emits 
GPS points with a fixed frequency (e.g., 0.1 Hz). The length of the each route is thus 
the speed of the car times the number of GPS points it emits. In the simulation, routes 
are allowed to have variable lengths. So when starting a new synthetic route, we first 
generate a random number between 480 and 520 for the number of GPS points. We use 
480 and 520 because the median number of GPS points of the routes in the real data 
set is 488. Then we randomly select a network point to start a new route. When taking 
the next point, we follow the graph and traverse to the next edge (randomly pick one 
if more than one outgoing edge exists). The sampling frequency is fixed for one data 
set to simulate a real life application. We then vary the sampling frequency, resulting 

1 http://tinyurl.com/bqtgh2g 

2 http://www.trafikdage.dk/td/papers/papers07/tdpaper27.pdf 
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in three different data sets, i.e., short, medium, and long, with the median lengths of 
routes being 3405.76, 8030.42, and 12890.12, respectively. 

In both the real and synthetic data sets, for each route, we use a random number 
generator to generate the user count and route usage randomly from 1 to 20. 

6.1.3 Facilities 

The facility data set contains 16,577 places of interest located throughout Denmark. 
The exact address of each facility can be looked up from yellow pages. Since it is 
meaningless to take businesses of different types, we group the facilities according 
to their types (e.g., fast food, salon, supermarket). In all the experiments below, the 
facilities are of the same type. When generating the synthetic facility data set, we 
randomly pick network points from the network. Every facility in either the real data 
set or the synthetic attracts at least one route. 

Statistics on the data sets and the settings for key parameters are summarized in 
Table [3] The default values are in bold. 

6.1.4 Scoring Function and Score Distribution Model 

We observe from the experiments that the scoring function and the score distribution 
model do not affect the performance of the two algorithms. Therefore, we only show 
the experimental results produced when using the first scoring function and the first 
proposed model. 



Parameter 


Range 


S 


0.02, 0.04, 0.06, ...,0.12 


P 


2, 3, 4, 5, 6 


Num Routes in Real Data 


5k, 10k, . . ., 25k 


Num Routes in Synthetic Data 


10k, 15k, . . ., 30k 


Num Facilities 


600, 800, lk,..., 1.4k 



Table 3: Experimental Settings 



6.2 Effect of 5 

Recall that a facility attracts a route if their distance is no further than S. Figure|5]shows 
the performance and optimal scores when varying 6 on real data. 

Although the running times for both algorithms increase when S increases (Fig- 
ure |5(a)]i, the two algorithms exhibit different patterns. When 6 increases from 0.02 
to 0.1, AUG increases much faster than ITE. AUG has to explore further on the edges 
to find the attracting facilities for each route traversal, in order to decide whether to 
include them in the augmented graph. This may be the reason why AUG increases 
more rapidly than ITE. When 5 increases from 0.1 to 0.12, the running time of AUG 
increases slower. The reason may be that less facilities are taken into account. In re- 
ality, some facilities prefer locations near the junctions, so the density of facilities in 
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Figure 5: Effect of S 

the middle of roads might be less. The increase of the running time of ITE is less, and 
there is no sudden change, indicating that 5 has little effect on ITE. 

Since the optimal scores output by both algorithms are the same, we plot one figure 
to show the effect of S (Figure [5(b)) . The optimal scores decrease like a staircase. The 
reason is that increasing S may increase the number of attracting facilities for a route, 
resulting in decreased scores of segments received from the routes according to the 
score distribution model. 



6.3 Effect of (3 

Recall that f3 is the number of subsegments produced when a segment is partitioned. 
It is a user-specified parameter. Figure |6]shows the effect of (3 on the running time of 
ITE. 




54 1 1 1 1 1 

2 3 4 5 6 

Beta 



Figure 6: Effect of /?, Real 



Initially, as j3 value increases, the running time decreases. However, beyond a 
certain j3 value (4 in the figure), with further increase in the next value, the running 
time starts to increase. The best performance of ITE occurs when /3 = 4. When the 
/3 value is smaller than 4, the "zooming-into" an optimal subsegment may not be as 
fast as when fj = 4. On the other hand, when the j3 value is greater than 4, computing 
the lower and upper bounds of the subsegments can take substantial time, and thus the 
increase in running time. 
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6.4 Effect of the Number of Routes 



Figure [7] shows the performance when varying the number of routes using real and 
synthetic data. Algorithms AUG and ITE perform equally well for a small number 
(5k) of routes. But the running time of AUG grows much more rapidly than that of ITE 
with the increase of the quantity of routes. This is expected from the time complexity 
analysis of AUG and ITE. 
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Figure 7: Effect of the Number of Routes on Performance 



6.5 Effect of Route Length 

In this set of experiments, we study the effect of the length of routes on the perfor- 
mances of both algorithms. The three data sets used in the experiments are explained 
above. Figure [8] shows the results. For both algorithms, more time is needed for longer 
routes when the number of route traversals ranges from 5k to 25k. Again, the running 
time of AUG increases faster than that of ITE. 
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Figure 8: Effect of Route Length 



For AUG, computing the Attraction List for the vertices takes longer time as each 
route covers more vertices on average. For ITE, each edge intersects more routes on 
average. So after splitting a segment, more routes have to be examined to calculate the 
lower and upper bound scores of the subsegments, resulting in longer running time. 
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6.6 Effect of the Number of Facilities 

Figure|9]shows the running time of AUG and ITE when varying the number of facilities. 
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Figure 9: Effect of the Number of Facilities 

For both kinds of routes, AUG is affected slightly more by the increase in facilities. 
The reason is that in AUG, facilities have to be augmented, and then attraction lists have 
to be calculated for them, resulting in substantial computation. In contrast, facilities 
cause little computation in ITE. When ITE partitions the segments, no house-keeping 
is necessary for facilities. It just needs to adjust the relative positions if the facilities 
according to the newly produced segments. 



6.7 Effectiveness of Pruning Strategies 

In this set of experiments, we study the effectiveness of the pruning strategies in ITE 
by keeping track of the number of segments generated, partitioned, and pruned in the 
course of finding the optimal segments. Figure [10] shows the respective segments gen- 
erated, split, and pruned by Lemma [3] and Lemma [4] when running ITE with default 
settings. Label "total" means the total number of generated subsegments, "splits" is 
the number of subsegments that needs further splitting, "prune 1" is the number of sub- 
segments that are pruned using Lemma [3] and "prune2" is the number of subsegments 
that are pruned using Lemma |4] 



2.00*10 



9, 1.50*10" 



w 1.00*10 4 



3 5.00*1 J 



0.00*10" 



total splits prunel prune2 
(a) Real 





2.50*1 5 


CO 

Tz 


2.00*1 5 


<D 




E 

o> 


1.50*10 5 


<D 




U) 




o 


1.00*10 5 


£ 




Z3 
Z 


5.00*1 4 




0.00*10° 



total splits prunel prune2 
(b) Synthetic 



Figure 10: Effect of Pruning Strategies 
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It is observed that in the real data set the number of segments that require fur- 
ther splitting is 4,273, which is approximately 25% of the number of total segments, 
whereas the number is between 5% and 10% in the synthetic data set. 

In the real data set, almost 60% of the generated segments that cannot contain an 
optimal segment are been pruned by Lemma[3] In contrast, Lemma[4]prunes 2.7% of 
the total segments. 

In the synthetic data set where route traversals are generated more evenly through- 
out the entire map, Lemma [3] prunes almost 10% of the total generated segments. 
Lemma|4]prunes around 1% of the total segments. 

7 Related Work 

The paper's study relates to two previously studied problems, the facility location prob- 
lem (FLP) and flow intercepting facility location problem (FIFLP), which we cover in 
turn. 

7.1 Facility Location Problem 

The classical facility location problem Q[10] QUEUED takes as input a finite set C 
of customer locations and a finite set P of candidate facility locations, and it returns k 
(k > 0) facilities in P that optimizes a predefined metric. 

The single facility location problem [10, 1 9 1 finds one location in P that optimizes 
a predefined metric with respect to a set C of customer locations. It assumes that no 
facility has been built previously; in contrast, our optimal segment problem permits the 
presence of a set F of existing facilities. 

The online facility location problem [12, 18] assumes a dynamic setting, where (i) 
the set C of customers is initially empty, and (ii) new customers may be inserted into 
C as time evolves. The solution to this problem constructs facilities one at a time, such 
that its quality (with respect to some metric) is competitive in comparison to solutions 
that are given all customer points in advance. This problem assumes that the set P of 
candidate facility locations is finite, while our optimal segment problem does not. 

Many works [7, 9, 22-25] study another variant of the facility location problem, 
the so-called the optimal location (OL) problem, where only the optimal locations are 
returned from an infinite number of candidate locations, given a finite set of preexisting 
facilities F. The problem is studied in L p space. Recently, Xiao et al. [23| extends the 
problem to a spatial network setting, using network distance in place of L p distance. 
Our optimal segment problem is related to the OL query, but uses route traversals 
instead of static customer point locations. The techniques presented in these previous 
works cannot be applied to solve the optimal segment problem. 

7.2 Flow Intercepting Facility Location 

The flow intercepting facility location (FIFL) problem is similar to our problem in that 
it models demand by means of customer flows. Here, customer trips are pre-planned, 
and customers can choose to visit a facility or not during their trips by deviating from 
a pre-planned route. 



25 



Hodgson 11311141 was the first to identify and study an FIFL-type problem where 
the placement of facilities minimizes the total deviation from preplanned trips made 
by a population of customers. Later, Berman and collaborators investigate a variety 
of versions of this problem: (i) the optimal location for discretionary facilities J5], (ii) 
facility location given probabilistic flows [6 1, (iii) locating facilities with finite capac- 
ities 0, (iv) locating facilities when the level of customer usage of a service depends 
on the number of facilities they encounter along their path ||2), (v) locating competitive 
facilities (demand and flow coverage problem) [4 |. 

Our study differs from this existing work in important ways. We assume a realistic 
setting and propose efficient means of placing a facility on a road segment, considering 
existing facilities and customer movements derived from GPS data. Our framework 
enables the use of scoring functions that generate scores from customer traversals of 
routes, and it enables the use of models that distribute these scores to road segments. 
The framework is open to such functions and models and thus enables the modeling 
of a wide variety of scenarios. Our approach can easily be augmented to model the 
unavailability of locations in a spatial network, so that such locations are not considered 
in results. 

8 Conclusions and Future Work 

The paper formalizes a modern version of the classical facility location problem that 
takes into account the availability of customer trajectory data that is constrained to a 
road network, rather than simply assuming the availability of static customer locations. 
In the resulting framework, route traversals by customers rather than customer loca- 
tions are attracted by facilities. The framework enables a wide variety of choices for 
assigning scores to the routes traversed by customers and for distributing these scores 
to segments in the underlying road network, thus offering flexibility that aims to enable 
applications with different types of facilities. We believe that this work provides a new 
and realistic generalization of the classical facility location problem. 

Two algorithms, AUG and ITE, are provided to solve this generalized problem. 
AUG takes a graph augmentation approach, and ITE iteratively partitions road seg- 
ments into smaller pieces (subsegments) while using a scoring mechanism to guide the 
selection of promising segments for further partitioning. The paper reports on empir- 
ical studies with both real and synthetic routes map-matched to a real spatial network 
that demonstrate practicality of the proposed algorithms. Algorithm ITE outperforms 
AUG thanks to its sophisticated pruning techniques hat effectively reduce the search 
space. 

Several interesting directions for future work exist, including the following two. 
First, the optimal segments can be incrementally evaluated when new routes are avail- 
able. Incremental evaluation allows more flexibility when new routes are continuously 
added and may help improve the performance. Second, future work may consider find- 
ing top-fc segments. 
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